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Abstract — Document analysis plays an important role in office 
automation, especially in intelligent signal processing. The 
proposed system consists of two modules: block segmentation 
and block identification. In this approach, first a document is 
segmented into several non-overlapping blocks by utilizing a 
novel recursive segmentation technique, and then extracts 
the features embedded in each segmented block are extracted. 
Two kinds of features, connected components and image 
boundary/perimeter features are extracted. The features are 
verified to be effective in characterizing document blocks. 
Last, the identification module to determine the identity of 
the considered block is developed. Wide verities of documents 
are used to verify the feasibility of this approach. 

Index Terms — OCR, Segmantation, Connected components 

I. Introduction 

Document image layout analysis is a crucial step in many 
applications related to document images, like text extraction 
using optical character recognition (OCR), reflowing 
documents, and layout-based document retrieval. Layout 
analysis is the process of identifying layout structures by 
analyzing page images. Layout structures can be physical 
(text, graphics, pictures,) or logical (titles, paragraphs, 
captions, headings . . .). The identification of physical layout 
structures is called physical or geometric layout analysis, 
while assigning different logical roles to the detected regions 
is termed as logical layout analysis. 

The purpose of document-image analysis is to transform 
the information contained on a digitized document image into 
an equivalent symbolic representation. Text parts are the main 
information carriers in most applications. For that purpose, it 
is necessary to locate text objects within the image, recognize 
them, and extract the hidden information. The documents 
may contain, besides text, graphics and images that overlap. 
Since text lines are not always horizontally aligned, finding 
text parts and locating characters, words, and lines are not 
trivial tasks. Due to the tremendous reduction in the storage 
space of the processed results, it is advantageous to 
reproduce, transmit, and store the document in the processed 
form. The extracted regions can then be processed by a 
subsequent step according to their types, e.g. OCR for text 
regions and compression for graphics and halftone 
images. Up to now several strategies have been tried to solve 
the problem of segmentation. Techniques for page 
segmentation and layout analysis are broadly divided in to 



three main categories: top-down, bottom-up and hybrid 
techniques [1]. Many bottom-up Approaches are used for 
page segmentation and block identification [5], [ll].Yuan, Tan 
[2] designed method that makes use of edge information to 
extract textual blocks from gray scale document images. It 
aims at detecting only textual regions on heavy noise infected 
newspaper images and separate them from graphical regions. 

The White Tiles Approach [3 ] described new approaches 
to page segmentation and classification. In this method, 
once the white tiles of each region have been gathered 
together and their total area is estimated, and regions are 
classified as text or images. George Nagy, Mukkai 
Krishnamoorthy [4] have proposed two complementary 
methods for characterizing the spatial structure of digitized 
technical documents and labeling various logical components 
without using optical character recognition. Projection profile 
method [6], [13] is used for separating the text and images, 
which is only suitable for Devanagari Documents (Hindi 
document). 

The main disadvantage of this method is that the irregular 
shaped images with non-rectangular shaped text blocks may 
result in loss of some text. They can be dealt with by adapting 
algorithms available for Roman script. Kuo-Chin Fan, Chi- 
Hwa Liu, Yuan-Kai Wang [15] have implemented a feature- 
based document analysis system which utilizes domain 
knowledge to segment and classify mixed text/graphics/image 
documents. This method is only suitable for pure text or image 
document, i.e. a document which has only text region or image 
region. This method is good for text-image identification not 
for extraction. 

The Constrained Run-Length Algorithm (CRLA) [16] is a 
well-known technique for page segmentation. The algorithm 
is very efficient for partitioning documents with Manhattan 
layouts but not suited to deal with complex layout pages, 
e.g. irregular graphics embedded in a text paragraph. Its main 
drawback is the use of only local information during the 
smearing stage, which may lead to erroneous linkage of text 
and graphics. Kuo-Chin Fan, Liang-Shen Wang, Yuan-Kai 
Wang [17] proposed an intelligent document analysis system 
to achieve the document segmentation and identification 
goal. The proposed system consists of two modules: block 
segmentation and block identification. Two kinds of features, 
connectivity histogram and multi resolution features are 
extracted, 
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II. Methodologyi 

A. Document Segmentation 

The main aim of document segmentation is to segment a 
document into several separate blocks with each block 
representing one type of medium. The method we adopted in 
segmenting documents is a combination of run-length 
smearing algorithm and boundary perimeter detection 
procedure. It first performs the smearing operation in both 
the horizontal and vertical directions to generate blocks. Due 
to the selection of an improper threshold in the run-length 
smearing process, an intact paragraph might be segmented 
into several consecutive horizontal stripes with each stripe 
representing one horizontal text line. The stripe merging 
procedure is thus devised to merge those text stripes that 
belong to the same paragraph. The merit is that it generates 
an intact text block instead of several smaller text stripes. By 
this way, it not only saves storage space but also reduces 
the burden in performing optical character recognition. 

B. Pre-processing 

Preprocessing of document images is the way of using 
mature image processing techniques to improve the quality 
of images. Its purpose is to enhance and extract useful 
information of images for later processing purposes. Two 
preprocessing tasks, thresholding and noise removal, are 
performed here. 
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Figure 1. Original color image and Binarized image. 
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Figure2. Image after horizontal run length smearing 

Document image is usually captured as colored image 
(RGB) and we convert it into Gray level image, image. Hence, 
a binarization procedure converting a gray-level image into a 
binary image is necessary. OTSU Document image is usually 
captured as colored image (RGB) and we convert it into Gray 
level image. Thresholding is utilized to accomplish the 
conversion task. Figure 1, shows the original and binarized 
image. 

C. Block Extraction 

With the pre-processing being done, a binarized image is 
then obtained. In this binarized image, each meaningful block 
which can be easily recognized by human beings is composed 
of pixels. RLSA is an operation to connect two nonadjacent 
runs into one merged run if the distance between these two 
runs is smaller than a threshold. 

It is assumed that white pixels are represented by and black 

pixels by 1 . If the number of between l's is less than or equal 

to a constant C, then is replaced by 1 . In other words, two 

runs having distance smaller than the threshold C will be 

merged into one run. Example 

Before smearing: 

1111111000001111111100011 

After smearing: 

1111111000001111111111111 

The result generated by the smearing operation is several 1- 
runs in each horizontal row. 

Two constants CHT (Constant Horizontal threshold 
value) and CVT (Constant Vertical threshold value) are needed 
due to the processing of RLSA in two directions. CHT 
selected as half of CVT. Figure 2 and 3 illustrate the operation 
of RLSA in two directions i.e. horizontal and vertical directions. 
The results generated by horizontal and vertical smearing 
operations are then combined by logical AND operation to 
produce the meaningful blocks. In other words, a pixel in the 
resulting image is black if the pixels at the same position of 
both vertical and horizontal Run-length smearing images are 
black. Though run length smearing can combine related pixels 
into meaningful blocks, there are still some small blocks which 
should be further combined. We repeatedly apply run length 
smearing with increased runlengths to the output of AND 
operation to merge these blochs. An image illustrating the 
result generated by RLSA is shown in Figure 4. 
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D. Boundary/Perimeter Detection 

Since we have obtained meaningful blocks, the next step 
is to find the outside boundary of smearing blocks. Outside 
boundary is defined as the white pixels surrounding the 
extracted boundary of a block. The traditional edge detection 
methods are not adequate for detecting this kind of boundary. 
For example, if traditional edge detection methods are 
operated on a one-pixel- width line block open contour will be 
generated. 
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Figure3. Image after vertical run length smearing 
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Figure4. Final Result of Run length smearing 




Figure5. Image after reapplying horizontal run length smearing 
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Figure6. Four possible search Direction. 

We use 4-connectivity as well as 8-connectivity for bound- 
ary extraction. It is observed that 4-connectivity gives batter 
results. Figure 6 shows the four possible search directions. A 
pixel is considered to be a part of the perimeter, if it is nonzero 
and it is connected to at least one nonzero value. 

Connectivity can also be defined in a more general way 
for any dimension by using connectivity for a 3-by-3 matrix 
of 0's and l's. The 1-valued elements define neighborhood 
locations relative to the center element of connectivity. Note 
that connectivity must be symmetric about its center ele- 
ment. Figure 7 shows the image after boundary detection. 

E. Connected Component and Area Computation 

After detection of boundary from the image, we find the 
connected component. Then, with the help of connected 
component we compute the area of closed loop. From Figure 
6 it is clear that the text which is converted into uniform 
stripes has uniform area. When we compare the area of the 
closed loop (stripes which form by run length smearing) with 
open loop area then, it fills Zero in smaller area. Figure 7 
shows the smaller area which is filled with zero or black pixel. 
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Figure7. Image after Boundary Detection. 

To find the connected component the algorithm uses the 
following general procedure: 

1 . Scan all image pixels, assigning preliminary labels to nonzero 
pixels and recording label equivalences in a union-find table. 

2. Resolve the equivalence classes using the union-find 
algorithm. 

3. Relabel the pixels based on the resolved equivalence 
classes. 



©2012 ACEEE 
DOL01.USIP.03.01.70 



12 



vc ACEEE 



ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 



F. Image-Text Separation 

The text areas are thus filled with zero or black pixel. By 
ORing the Original binary image with image so obtained entire 
text is separated from the original images. This is shown in 
Figure 9. After separation of text, we XOR the image in Figure 
9 with the image in Figure 3, to separated image/graphics 
from the original binary document. If some big text size is still 
associated with separated image then we repeat the run length 
smearing operation for the document image. The separated 
image is shown in Figure 10. Results for two more samples 
are given in Figure 1 1 and Figure 12. Text and image parts are 
clearly separated. 
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Figure8. Smaller Area filled with Zero smearing 
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Figure9. Text Extracted from given Document smearing. 




Conclusions 

Choice of the segmentation is very important step in the 
separation of newspaper and magazine document images. 
This choice is based on the quality of the input image, the 
required output quality and speed of processing. Run length 
smearing algorithm using boundary detection techniques are 
found to have good performance for the images having good 
separation between the text and the images pixels. Run length 
smearing algorithm and recursive segmentation technique 
are adopted in the segmentation stage to extract non- 
overlapping blocks embedded in the document. 

The Enhanced CRLA has been chosen to build a number 
of document processing systems because of its advantages 
compared with other page segmentation techniques. For 
example, although the texture- analysis-based approaches 
are powerful to handle various page layouts, they are 
commonly time-consuming because of the pixel-level 
classification. 

The methods based on connected component analysis 
may have problems to extract large headline characters and 
thus they suit to deal with documents of certain character 
size. However, the original CRLA can only treat Manhattan 
page format; the present work extends its capability to cover 
both Manhattan and non-Manhattan layout and thus 
expands its application domain significantly. But the range 
of mean length of horizontal black runs (MBRL) and White- 
black transition count per unit width (MTC) in CRLA is not 
suitable for all type of document images. 

To improve the performance of these techniques our 
method is implemented and it works well in case of all type of 
document images and works on text in several languages i.e. 
Hindi, English, Bangla, Telugu, Chinese. This method is 
language independent method. If we see Projection profile 
method [1], [13] it is only applicable for Devanagari Document 
images. It needs that the layout of newspaper document image 
is very specific. Otherwise it will not separate the text-image 
region. Run length smearing using stripe merging method is 
only suitable when stripes formed during rum length smearing 
are perfect rectangle. 

In our text-image separation system using boundary 
detection, there is no need of skew correction. Without skew 
correction it can work properly. Also, there is no requirement 
of labeling different sizes of text and images in the given 
documents. It can easily separate the text and image from 
document image. The method can be used recursively if the 
separated image contains some residue text. This system 
provides a good platform for implementing segmentation 
techniques suitable for developing real time processing 
techniques and processing large databases of newspaper 
and magazine document images. 



Figure 10. Image Extracted from given Document Image 
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